Takehome_Ex03

Take-home Exercise 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods

Background

Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.

Objective

To predict HDB resale prices at the sub-market level for the month of January and February 2023 in Singapore.The predictive models will be built by using by using conventional OLS method and GWR methods. The objective is to compare the performance of the conventional OLS method versus the geographical weighted methods.

Concept for GWR, Hedonic pricing model

Geographically weighted regression (GWR) is a spatial statistical technique that takes non-stationary variables into consideration (e.g., climate; demographic factors; physical environment characteristics)

In this take home assignment, we need to build predictive model using GWR. The predictive model need to take consideration into locational factors such as proximity to childcare centre, public transport service and shopping centre.

Hence, we have to first identify the relevant location factors for this assignment, which including:

  • Proximity to CBD

  • Proximity to eldercare

  • Proximity to foodcourt/hawker centres

  • Proximity to MRT

  • Proximity to park

  • Proximity to good primary school

  • Proximity to shopping mall

  • Proximity to supermarket

  • Numbers of kindergartens within 350m

  • Numbers of childcare centres within 350m

  • Numbers of bus stop within 350m

  • Numbers of primary school within 1km

Steps

Load Necessary Library

pacman::p_load(olsrr, corrplot, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, gtsummary,readxl,jsonlite,rvest)

Importing geospatial data (MP2019)

mpsz2019 = st_read(dsn = "data/geospatial", layer = "MPSZ-2019")
Reading layer `MPSZ-2019' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Updating CRS information

mpsz_svy21 <- st_transform(mpsz2019, 3414)

check whether the CRS has correctly assigned:

st_crs(mpsz_svy21)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Check invalid geometries

length(which(st_is_valid(mpsz_svy21) == FALSE))
[1] 6

Make the invalid geometries valid

mpsz_svy21 <- st_make_valid(mpsz_svy21)
length(which(st_is_valid(mpsz_svy21) == FALSE))
[1] 0

Aspatial Data Wrangling

Importing the aspatial data

resale = read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
Rows: 148277 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We plan to look at four-room flat and a transaction period between1st January 2021 to 31st December 2022.

resale <- resale %>% 
  filter(flat_type == "4 ROOM") 
resale
# A tibble: 61,961 × 11
   month   town    flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
   <chr>   <chr>   <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
 1 2017-01 ANG MO… 4 ROOM  472   ANG MO… 10 TO …      92 New Ge…    1979 61 yea…
 2 2017-01 ANG MO… 4 ROOM  475   ANG MO… 07 TO …      91 New Ge…    1979 61 yea…
 3 2017-01 ANG MO… 4 ROOM  629   ANG MO… 01 TO …      94 New Ge…    1981 63 yea…
 4 2017-01 ANG MO… 4 ROOM  546   ANG MO… 01 TO …      92 New Ge…    1981 63 yea…
 5 2017-01 ANG MO… 4 ROOM  131   ANG MO… 01 TO …      98 New Ge…    1979 61 yea…
 6 2017-01 ANG MO… 4 ROOM  254   ANG MO… 04 TO …      97 New Ge…    1977 59 yea…
 7 2017-01 ANG MO… 4 ROOM  470   ANG MO… 04 TO …      92 New Ge…    1979 61 yea…
 8 2017-01 ANG MO… 4 ROOM  601   ANG MO… 04 TO …      91 New Ge…    1980 62 yea…
 9 2017-01 ANG MO… 4 ROOM  463   ANG MO… 04 TO …      92 New Ge…    1980 62 yea…
10 2017-01 ANG MO… 4 ROOM  207   ANG MO… 04 TO …      97 New Ge…    1976 58 yea…
# … with 61,951 more rows, 1 more variable: resale_price <dbl>, and abbreviated
#   variable names ¹​flat_type, ²​street_name, ³​storey_range, ⁴​floor_area_sqm,
#   ⁵​flat_model, ⁶​lease_commence_date, ⁷​remaining_lease

However, when we look at the code, there is no Longitude and Latitude Information, which means we need to geocode it, to make sure we get accurate geocode result, we have to combine the Block and Street Name as input:

In OneMapSG API, “ST.” is usually written as “SAINT” instead, so it is better for us to replace ST. with SAINT

resale$street_name <- gsub("ST\\.", "SAINT", resale$street_name)

Here is the code for geocoding the aspatial data: (Reference was taken from the senior sample submissions for the code for this section, with credit to MEGAN SIM TZE YEN)

# 
# library(httr)
# geocode <- function(block, streetname) {
#   base_url <- "https://developers.onemap.sg/commonapi/search"
#   address <- paste(block, streetname, sep = " ")
#   query <- list("searchVal" = address, 
#                 "returnGeom" = "Y",
#                 "getAddrDetails" = "N",
#                 "pageNum" = "1")
#   
#   res <- GET(base_url, query = query)
#   restext<-content(res, as="text")
#   
#   output <- fromJSON(restext)  %>% 
#     as.data.frame %>%
#     select(results.LATITUDE, results.LONGITUDE)
# 
#   return(output)
# }
# resale$LATITUDE <- 0
# resale$LONGITUDE <- 0
# 
# for (i in 1:nrow(resale)){
#   temp_output <- geocode(resale[i, 4], resale[i, 5])
#   
#   resale$LATITUDE[i] <- temp_output$results.LATITUDE
#   resale$LONGITUDE[i] <- temp_output$results.LONGITUDE}

Make a copy for the geocoded data:

Make a copy for the geocoded data so we don’t need to run the previous process again:

#write_csv(resale, "data/rds/resalecopy1.csv")

Read the geocoded data

resale = read_csv("data/rds/resalecopy1.csv")
Rows: 23656 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (5): floor_area_sqm, lease_commence_date, resale_price, LATITUDE, LONGITUDE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Update CRS

resale.sf <- st_as_sf(resale,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

Structural Factors

FLOOR LEVEL

unique(resale$storey_range)
 [1] "04 TO 06" "01 TO 03" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
 [7] "19 TO 21" "22 TO 24" "28 TO 30" "25 TO 27" "34 TO 36" "31 TO 33"
[13] "43 TO 45" "37 TO 39" "40 TO 42" "46 TO 48" "49 TO 51"

We can observe that there are 17 categories of storey ranges. To represent each range, we will use the pivot_wider() function to create duplicate variables, assigning a value of 1 if the observation falls within the range and 0 if it does not.

resale <- resale %>%
  pivot_wider(names_from = storey_range, values_from = storey_range, 
              values_fn = list(storey_range = ~1), values_fill = 0) 

REMAINING LEASE

The current format of the remaining_lease variable is a string, but we need it to be in numeric format. To achieve this, we can split the string into its month and year values, calculate the remaining lease in years, and replace the original string values with the calculated numeric values.

str_list <- str_split(resale$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      resale$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    resale$remaining_lease[i] <- year
  }
}
resale_sf <- st_as_sf(resale, 
                      coords = c("LONGITUDE", 
                                 "LATITUDE"), 
                      crs=4326) %>%
  st_transform(crs = 3414)

Locational Factor

CBD

As the proximity to CBD is one of the locational factor we interested in to improve our predicted model, let’s take the coordinates of Downtown core to be the coordinates of CBD

lat <- 1.287953
lng <- 103.851784

cbd_sf <- data.frame(lat, lng) %>%
  st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
  st_transform(crs=3414)

Eldercare

Eldercare_sf <- st_read(dsn = "data/geospatial/eldercare-services-shp", 
                layer = "ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial\eldercare-services-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
st_crs(Eldercare_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Eldercare_sf <- st_transform(Eldercare_sf, crs=3414)

Kindergartens

To get the Kindergartens’ geometry data, I refer to Megan’s method of using ONEMAP API to do the geocoding

The step including:

  1. Get the token from ONEMAP API
  2. Search the themes and check the themes available
  3. Using get_theme function to get the geospatial information of the Kindergarten
library(sf)
library(onemapsgapi)

token <- "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjEwMDUyLCJ1c2VyX2lkIjoxMDA1MiwiZW1haWwiOiJ0YWtvdGFrb3YwMDBAZ21haWwuY29tIiwiZm9yZXZlciI6ZmFsc2UsImlzcyI6Imh0dHA6XC9cL29tMi5kZmUub25lbWFwLnNnXC9hcGlcL3YyXC91c2VyXC9zZXNzaW9uIiwiaWF0IjoxNjc5NTU4NDYxLCJleHAiOjE2Nzk5OTA0NjEsIm5iZiI6MTY3OTU1ODQ2MSwianRpIjoiZDgyZTg1MDMzMTUzMTRkZjEzYjk5MWJmMDJkMDQ1NjIifQ.070RoJrraz95GuLVvYZpfzyMyGWQZ6S0D5FsLL39WGU"
search_themes(token, "kindergartens") 
# A tibble: 1 × 5
  THEMENAME     QUERYNAME     ICON       CATEGORY  THEME_OWNER                  
  <chr>         <chr>         <chr>      <chr>     <chr>                        
1 Kindergartens kindergartens school.gif Education EARLY CHILDHOOD DEVELOPMENT …
Kindergartens_tibble <- get_theme(token, "kindergartens")
Kindergartens_sf <- st_as_sf(Kindergartens_tibble, coords=c("Lng", "Lat"), crs=4326)
Kindergartens_sf <- st_transform(Kindergartens_sf, crs=3414)

Childcare Center

search_themes(token, "childcare")
# A tibble: 1 × 5
  THEMENAME           QUERYNAME ICON                    CATEGORY THEME_OWNER    
  <chr>               <chr>     <chr>                   <chr>    <chr>          
1 Child Care Services childcare onemap-fc-childcare.png Family   EARLY CHILDHOO…
library(sf)
library(onemapsgapi)

themetibble <- get_theme(token, "childcare")
childcaresf <- st_as_sf(themetibble, coords=c("Lng", "Lat"), crs=4326)
childcaresf <- st_transform(childcaresf, crs=3414)

Park

search_themes(token, "parks")
# A tibble: 25 × 5
   THEMENAME                                       QUERY…¹ ICON  CATEG…² THEME…³
   <chr>                                           <chr>   <chr> <chr>   <chr>  
 1 NParks BBQ Pits                                 nparks… null  null    NATION…
 2 Tree Conservation Area                          tree_c… null  null    NATION…
 3 Singapore Coastal Habitats Map from High Resol… singap… null  null    NATION…
 4 Centralised Grasscutting Area                   centra… unti… Enviro… NATION…
 5 Nature Reserves Gazette 2005                    nr_gaz… gaze… Recrea… NATION…
 6 Heritage Trees                                  herita… tree… Enviro… NATION…
 7 NParks Activity Area                            nparks… null  null    NATION…
 8 Under-Construction Parks                        nparks… null  null    NATION…
 9 Prohibited Drone flying areas at NParks Nature… drone_… LOGG… Recrea… NATION…
10 Under-Construction Park Facilities              underc… null  null    NATION…
# … with 15 more rows, and abbreviated variable names ¹​QUERYNAME, ²​CATEGORY,
#   ³​THEME_OWNER
library(sf)
library(onemapsgapi)
Parks_themetibble <- get_theme(token, "nationalparks")
Parks_sf <- st_as_sf(Parks_themetibble, coords=c("Lng", "Lat"), crs=4326)
Parks_sf <- st_transform(Parks_sf, crs=3414)

Shopping mall

(Reference was taken from the senior sample submissions for the code for this section, with credit to NOR AISYAH BINTE AJIT) The approach is about extracting the names of the malls from Wikipedia and then use get_coords function to obtain the respective coordinates

The following code chunk performs three steps:

  • Reads the Wikipedia HTML page containing the Shopping Malls in Singapore.

  • Reads the text portion of the unordered list element selected by html_nodes().

  • Appends the extracted mall names to an empty mall_list.

get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://developers.onemap.sg/commonapi/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}
url <- "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
malls_list <- list()

for (i in 2:7){
  malls <- read_html(url) %>%
    html_nodes(xpath = paste('//*[@id="mw-content-text"]/div[1]/div[',as.character(i),']/ul/li',sep="") ) %>%
    html_text()
  malls_list <- append(malls_list, malls)
}
library(httr)
malls_list_coords <- get_coords(malls_list) %>% 
  rename("mall_name" = "address")

Some of the names of the shopping malls in our dataset are not up-to-date, which can cause issues in the analysis. For example, POMO has been renamed to GR.ID and OD Mall has been renamed to The Grandstand. Hence, we should address these issues before proceeding. One of the shopping malls, Yew Tee Shopping Centre, was not found when researched further, so we should remove this row from our dataset

malls_list_coords <- subset(malls_list_coords, mall_name!= "Yew Tee Shopping Centre")
invalid_malls<- subset(malls_list_coords, is.na(malls_list_coords$postal))
invalid_malls_list <- unique(invalid_malls$mall_name)
corrected_malls <- c("Clarke Quay", "City Gate", "Raffles Holland V", "Knightsbridge", "Mustafa Centre", "GR.ID", "Shaw House",
                     "The Poiz Centre", "Velocity @ Novena Square", "Singapore Post Centre", "PLQ Mall", "KINEX", "The Grandstand")

for (i in 1:length(invalid_malls_list)) {
  malls_list_coords <- malls_list_coords %>% 
    mutate(mall_name = ifelse(as.character(mall_name) == invalid_malls_list[i], corrected_malls[i], as.character(mall_name)))
}

Then create a list to store unique names

malls_list <- sort(unique(malls_list_coords$mall_name))

To retrieve the coordinates of shopping malls

malls_coords <- get_coords(malls_list)
malls_coords[(is.na(malls_coords$postal) | is.na(malls_coords$latitude) | is.na(malls_coords$longitude)), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)
malls_sf <- st_as_sf(malls_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

Primary School

pri_sch <- read_csv("data/geospatial/general-information-of-schools.csv")
Rows: 346 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (31): school_name, url_address, address, postal_code, telephone_no, tele...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pri_sch <- pri_sch %>%
  filter(mainlevel_code == "PRIMARY"| mainlevel_code == "MIXED LEVELS") %>%
  select(school_name, address, postal_code, mainlevel_code)
prisch_list <- sort(unique(pri_sch$postal_code))
prisch_coords <- get_coords(prisch_list)
prisch_coords[(is.na(prisch_coords$postal) | is.na(prisch_coords$latitude) | is.na(prisch_coords$longitude)), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)
prisch_coords = prisch_coords[c("postal","latitude", "longitude")]
pri_sch <- left_join(pri_sch, prisch_coords, by = c('postal_code' = 'postal'))
prisch_sf <- st_as_sf(pri_sch,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

Select top 10 school

I refer to this website and check which are the exact top 10 primary schools and get their postal code, then I filtered them from the primary school dataset

postal_codes <- c(599986, 449761, 597610, 536451, 579767, 128806, 569405, 738907, 579646, 227988)

# Filter prisch_sf for rows with postal code in the vector
top10prisch_sf <- prisch_sf %>%
  filter(postal_code %in% postal_codes)
top10prisch_sf
Simple feature collection with 10 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 19962.23 ymin: 31882.69 xmax: 36701.12 ymax: 47144.77
Projected CRS: SVY21 / Singapore TM
# A tibble: 10 × 5
   school_name                 address posta…¹ mainl…²            geometry
 * <chr>                       <chr>   <chr>   <chr>           <POINT [m]>
 1 ADMIRALTY PRIMARY SCHOOL    11   W… 738907  PRIMARY (24296.63 47144.77)
 2 AI TONG SCHOOL              100  B… 579646  PRIMARY (27966.81 38071.92)
 3 ANGLO-CHINESE SCHOOL (JUNI… 16   W… 227988  PRIMARY (28916.32 32403.75)
 4 CATHOLIC HIGH SCHOOL        9    B… 579767  MIXED … (29212.17 37386.95)
 5 CHIJ ST. NICHOLAS GIRLS' S… 501  A… 569405  MIXED … (28104.02 39497.61)
 6 HOLY INNOCENTS' PRIMARY SC… 5    L… 536451  PRIMARY  (34780.5 38781.92)
 7 METHODIST GIRLS' SCHOOL (P… 11   B… 599986  PRIMARY  (22443.8 35016.19)
 8 NAN HUA PRIMARY SCHOOL      30   J… 128806  PRIMARY (19962.23 33496.24)
 9 PEI HWA PRESBYTERIAN PRIMA… 7    P… 597610  PRIMARY  (21632.9 35581.06)
10 TAO NAN SCHOOL              49   M… 449761  PRIMARY (36701.12 31882.69)
# … with abbreviated variable names ¹​postal_code, ²​mainlevel_code

Hawker Center

hawker_sf <- st_read("data/geospatial/hawker-centres-kml.kml") 
Reading layer `HAWKERCENTRE' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial\hawker-centres-kml.kml' 
  using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449217
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
hawker_sf <- st_transform(hawker_sf, crs=3414)
st_crs(hawker_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Bus Stop

bus_sf <- st_read(dsn="data/geospatial/BusStop_Feb2023", layer="BusStop")
Reading layer `BusStop' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial\BusStop_Feb2023' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
bus_sf <- st_transform(bus_sf, crs=3414)

Mrt

rail_network_sf <- st_read(dsn="data/geospatial/TrainStation_Feb2023", layer="RapidTransitSystemStation")
Reading layer `RapidTransitSystemStation' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial\TrainStation_Feb2023' 
  using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
GDAL Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
rail_network_sf<- st_transform(rail_network_sf, crs=3414)

SuperMarket

supermkt_sf <- st_read("data/geospatial/supermarkets-geojson.geojson")
Reading layer `supermarkets-geojson' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial\supermarkets-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
supermkt_sf<- st_transform(supermkt_sf, crs=3414)

Check invalid geometry for locational factor

Bus stop

length(which(st_is_valid(bus_sf) == FALSE))
[1] 0

MRT

length(which(st_is_valid(rail_network_sf) == FALSE))
[1] 2

It got invalid geometry and here is the code to make it valid:

# # Identify the invalid geometry
# library(leaflet)
# invalid_geom <- which(!st_is_valid(rail_network_sf))
# 
# # Use st_cast() to convert the invalid geometry to a valid geometry type
# rail_network_sf <- st_cast(rail_network_sf, "MULTILINESTRING")
# 
# # Use shiny to manually edit the geometry
# library(shiny)
# shinyApp(
#   ui = fluidPage(
#     leafletOutput("map")
#   ),
#   server = function(input, output) {
#     output$map <- renderLeaflet({
#       leaflet(rail_network_sf) %>%
#         addTiles() %>%
#         addPolylines()
#     })
#     observe({
length(which(st_is_valid(rail_network_sf) == FALSE))
[1] 2
length(which(st_is_valid(childcaresf) == FALSE))
[1] 0

ElderlyCare

length(which(st_is_valid(Eldercare_sf) == FALSE))
[1] 0
length(which(st_is_valid(cbd_sf)))
[1] 1

CBD

cbd_sf <- st_make_valid(cbd_sf)
length(which(st_is_valid(cbd_sf) == FALSE))
[1] 0

Childcare

length(which(st_is_valid(childcaresf) == FALSE))
[1] 0

Kindergartens

length(which(st_is_valid(Kindergartens_sf) == FALSE))
[1] 0

Malls

length(which(st_is_valid(malls_sf) == FALSE))
[1] 0

Parks

length(which(st_is_valid(Parks_sf) == FALSE))
[1] 0

Primary school

length(which(st_is_valid(prisch_sf) == FALSE))
[1] 0

Top primary school

length(which(st_is_valid(top10prisch_sf) == FALSE))
[1] 0

Calculate Proximity

library(units)
udunits database from C:/Users/65917/AppData/Local/R/win-library/4.2/units/share/udunits/udunits2.xml
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}
# install.packages("matrixStats")
# library(matrixStats)
# proximity <- function(df1, df2, varname) {
#   dist_matrix <- st_distance(df1, df2) %>%
#     drop_units()
#   df1[,varname] <- rowMins(dist_matrix)
#   return(df1)
# }
# resale.sf <-
#   proximity(resale.sf, cbd_sf, "PROX_CBD") %>%
#   proximity(., childcaresf, "PROX_CHILDCARE") %>%
#   proximity(., Eldercare_sf, "PROX_ELDERCARE") %>%
#   proximity(., hawker_sf, "PROX_HAWKER") %>%
#   proximity(., rail_network_sf, "PROX_MRT") %>%
#   proximity(., Parks_sf, "PROX_PARK") %>%
#   proximity(., top10prisch_sf, "PROX_TOPPRISCH") %>%
#   proximity(., malls_sf, "PROX_MALL") %>%
#   proximity(., supermkt_sf, "PROX_SPRMKT")
# num_radius <- function(df1, df2, varname, radius) {
#   dist_matrix <- st_distance(df1, df2) %>%
#     drop_units() %>%
#     as.data.frame()
#   df1[,varname] <- rowSums(dist_matrix <= radius)
#   return(df1)
# }
# resale.sf <- 
#   num_radius(resale.sf, Kindergartens_sf, "NUM_KNDRGTN", 350) %>%
#   num_radius(., childcaresf, "NUM_CHILDCARE", 350) %>%
#   num_radius(., bus_sf, "NUM_BUS_STOP", 350) %>%
#   num_radius(., prisch_sf, "NUM_PriSch", 1000)
# resale.sf <- resale.sf %>%
#   mutate() %>%
#   rename("AREA_SQM" = "floor_area_sqm",
#          "LEASE_YRS" = "remaining_lease",
#          "PRICE"= "resale_price") %>%
#   relocate(`PRICE`)

The figure above reveals a right skewed distribution. This means that more condominium units were transacted at relative lower price, and we need to normalize it by using log transformation

# st_write(resale.sf, "data/geospatial/resale-final_cleaned.shp")
resale.sf <- st_read(dsn="data/geospatial", layer="resale-final_cleaned")
Reading layer `resale-final_cleaned' from data source 
  `C:\Quanfang777\IS415-GAA\Takehome_Exercise\Takehome_Exercise3\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 23656 features and 24 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.79 ymin: 28217.39 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
resale.sf <- resale.sf %>%
  rename( "AREA_SQM"= "AREA_SQ",
          "LEASE_YRS" = "LEASE_Y" , 
          "PROX_CBD" = "PROX_CB",
          "PROX_CHILDCARE" = "PROX_CH" ,
          "PROX_ELDERCARE" = "PROX_EL",
          "PROX_HAWKER"= "PROX_HA" ,
        "PROX_PARK"=  "PROX_PA",
          "PROX_MRT" ="PROX_MR",
          "PROX_SPRMKT"= "PROX_SP",
        "PROX_MALL"= "PROX_MA",
         "NUM_KNDRGTN"= "NUM_KND" ,
         "NUM_CHILDCARE"="NUM_CHI" ,
        "NUM_BUS_STOP"="NUM_BUS" ,
         "PROX_TOPPRISCH"= "PROX_TO"
      
      ) %>%
  relocate(`PRICE`)

Drawing Statistical Point Map

tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)
tm_shape(resale.sf) +  
  tm_dots(col = "PRICE",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

From the statistical point map, we could observe that the southern and central area of the Singapore has a higher HDB resale price around 600k SGD

EDA

ggplot(data=resale.sf, aes(x=`PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a right skewed distribution. This means that more 4 room units were transacted at relative lower prices.Statistically, the skewed dsitribution can be normalised by using log transformation. We could see that most buildings seem to fall within 400k- 600k SGDdollars

resale.sf <- resale.sf %>%
  mutate(`LOG_PRICE` = log(PRICE))
ggplot(data=resale.sf, aes(x=`LOG_PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

resale.sf <- resale.sf  %>%
  mutate(resale.sf, LEASE_YRS = as.integer(str_sub(LEASE_YRS, 0, 2)))

Multiple Histogram Plots distribution of variables

Let’s also look at the Multiple Histogram Plots distribution of variables

library(ggpubr)
AREA_SQM <- ggplot(data = resale.sf, aes(x = `AREA_SQM`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_CBD <- ggplot(data = resale.sf, aes(x = `PROX_CBD`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_CHILDCARE <- ggplot(data = resale.sf, aes(x = `PROX_CHILDCARE`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_ELDERCARE <- ggplot(data = resale.sf, aes(x = `PROX_ELDERCARE`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_HAWKER <- ggplot(data = resale.sf, aes(x = `PROX_HAWKER`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_MRT <- ggplot(data = resale.sf, aes(x = `PROX_MRT`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_PARK <- ggplot(data = resale.sf, aes(x = `PROX_PARK`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_TOPPRISCH <- ggplot(data = resale.sf, aes(x = `PROX_TOPPRISCH`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_MALL <- ggplot(data = resale.sf, aes(x = `PROX_MALL`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

PROX_SPRMKT <- ggplot(data = resale.sf, aes(x = `PROX_SPRMKT`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')


ggarrange(AREA_SQM, PROX_CBD, PROX_CHILDCARE, PROX_ELDERCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_TOPPRISCH, PROX_MALL, PROX_SPRMKT, ncol = 2, nrow = 5)

NUM_CHILDCARE <- ggplot(data = resale.sf, aes(x = `NUM_CHILDCARE`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

NUM_KNDRGTN <- ggplot(data = resale.sf, aes(x = `NUM_KNDRGTN`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

NUM_BUS_STOP <- ggplot(data = resale.sf, aes(x = `NUM_BUS_STOP`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

LEASE_YRS <- ggplot(data = resale.sf, aes(x = `LEASE_YRS`)) + 
  geom_histogram(bins=20, color="black", fill = 'lightblue')

ggarrange(NUM_KNDRGTN, NUM_CHILDCARE, NUM_BUS_STOP,LEASE_YRS, ncol = 2, nrow = 2)

summary(resale.sf)
     PRICE            month               town             flt_typ         
 Min.   : 250000   Length:23656       Length:23656       Length:23656      
 1st Qu.: 440000   Class :character   Class :character   Class :character  
 Median : 490000   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 526124                                                           
 3rd Qu.: 568000                                                           
 Max.   :1370000                                                           
    block             strt_nm            stry_rn             AREA_SQM     
 Length:23656       Length:23656       Length:23656       Min.   : 70.00  
 Class :character   Class :character   Class :character   1st Qu.: 91.00  
 Mode  :character   Mode  :character   Mode  :character   Median : 93.00  
                                                          Mean   : 94.67  
                                                          3rd Qu.:100.00  
                                                          Max.   :145.00  
   flt_mdl             ls_cmm_       LEASE_YRS        PROX_CBD      
 Length:23656       Min.   :1967   Min.   :44.00   Min.   :  999.4  
 Class :character   1st Qu.:1988   1st Qu.:65.00   1st Qu.: 9701.8  
 Mode  :character   Median :2001   Median :79.00   Median :13106.4  
                    Mean   :2001   Mean   :78.33   Mean   :12130.6  
                    3rd Qu.:2015   3rd Qu.:92.00   3rd Qu.:14885.0  
                    Max.   :2019   Max.   :97.00   Max.   :19650.1  
 PROX_CHILDCARE   PROX_ELDERCARE    PROX_HAWKER        PROX_MRT       
 Min.   :  0.00   Min.   :   0.0   Min.   :  30.6   Min.   :   9.112  
 1st Qu.: 64.35   1st Qu.: 318.6   1st Qu.: 396.5   1st Qu.: 243.245  
 Median :105.50   Median : 600.4   Median : 684.5   Median : 427.835  
 Mean   :112.23   Mean   : 780.3   Mean   : 796.3   Mean   : 514.828  
 3rd Qu.:153.82   3rd Qu.:1083.9   3rd Qu.:1025.1   3rd Qu.: 704.065  
 Max.   :547.39   Max.   :3301.6   Max.   :2867.6   Max.   :2111.909  
   PROX_PARK       PROX_TOPPRISCH      PROX_MALL        PROX_SPRMKT    
 Min.   :  45.42   Min.   :  79.58   Min.   :  43.45   Min.   :   0.0  
 1st Qu.: 490.12   1st Qu.:2314.17   1st Qu.: 394.59   1st Qu.: 162.1  
 Median : 703.40   Median :3661.82   Median : 604.26   Median : 251.6  
 Mean   : 796.82   Mean   :3650.32   Mean   : 710.02   Mean   : 274.4  
 3rd Qu.:1014.12   3rd Qu.:4857.56   3rd Qu.: 917.68   3rd Qu.: 362.4  
 Max.   :2411.72   Max.   :8763.99   Max.   :2668.57   Max.   :1571.3  
  NUM_KNDRGTN     NUM_CHILDCARE     NUM_BUS_STOP       NUM_PrS     
 Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.: 3.000   1st Qu.: 6.000   1st Qu.:2.000  
 Median :1.0000   Median : 5.000   Median : 8.000   Median :3.000  
 Mean   :0.9865   Mean   : 4.841   Mean   : 7.958   Mean   :3.285  
 3rd Qu.:1.0000   3rd Qu.: 6.000   3rd Qu.:10.000   3rd Qu.:4.000  
 Max.   :8.0000   Max.   :22.000   Max.   :19.000   Max.   :9.000  
          geometry       LOG_PRICE    
 POINT        :23656   Min.   :12.43  
 epsg:3414    :    0   1st Qu.:12.99  
 +proj=tmer...:    0   Median :13.10  
                       Mean   :13.15  
                       3rd Qu.:13.25  
                       Max.   :14.13  

Hedonic Pricing Modelling

Multiple Linear Regression Method

resale_nogeo <- resale.sf %>%
  st_drop_geometry() %>%
  select_if(is.numeric) 
corrplot::corrplot(cor(resale_nogeo[, 2:17]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

Before building a multiple regression model, it is important to ensure that the indepdent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.

From the scatterplot matrix , we could see that we have “lease_commence_date” highly corrlated to “Lease_Years” so we decided to remove “lease_commence_date” to prevent the issue of multicollinearity

Data Sampling

training_data<-resale.sf %>% 
  filter(month >= "2021-01" & month <= "2022-10")

testing_data<-resale.sf %>%
  filter(month >= "2022-10" & month <= "2022-12")

Building a non-spatial multiple linear regression

price_mlr <- lm(PRICE ~ AREA_SQM + PROX_CBD +  PROX_ELDERCARE + PROX_HAWKER+ PROX_MRT +  PROX_PARK + PROX_TOPPRISCH + PROX_MALL +PROX_SPRMKT +LEASE_YRS + NUM_CHILDCARE+NUM_KNDRGTN+ NUM_PrS,
                data=training_data)
write_rds(price_mlr, "data/ML_results/price_mlr.rds")
gtsummary::tbl_regression(price_mlr)
Characteristic Beta 95% CI1 p-value
AREA_SQM 3,070 2,915, 3,226 <0.001
PROX_CBD -18 -19, -18 <0.001
PROX_ELDERCARE -9.5 -11, -7.7 <0.001
PROX_HAWKER -25 -27, -23 <0.001
PROX_MRT -34 -37, -31 <0.001
PROX_PARK 9.3 6.7, 12 <0.001
PROX_TOPPRISCH -0.34 -0.97, 0.30 0.3
PROX_MALL 8.2 5.3, 11 <0.001
PROX_SPRMKT 13 6.1, 20 <0.001
LEASE_YRS 5,224 5,143, 5,306 <0.001
NUM_CHILDCARE -2,098 -2,642, -1,555 <0.001
NUM_KNDRGTN 10,654 9,505, 11,802 <0.001
NUM_PrS -8,889 -9,659, -8,120 <0.001
1 CI = Confidence Interval

Building Hedonic Pricing Models using GWmodel

Converting the sf data.frame to SpatialPointDataFrame

train_data_sp <- as_Spatial(training_data)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 21761 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 25
names       :   PRICE,   month,       town, flt_typ, block,       strt_nm,  stry_rn, AREA_SQM,       flt_mdl, ls_cmm_, LEASE_YRS,         PROX_CBD,   PROX_CHILDCARE,   PROX_ELDERCARE,      PROX_HAWKER, ... 
min values  :  250000, 2021-01, ANG MO KIO,  4 ROOM,     1,  ADMIRALTY DR, 01 TO 03,       70, Adjoined flat,    1967,        44, 999.393538715878, 1.4128908036e-05, 1.9894378743e-05, 30.6031805185926, ... 
max values  : 1370000, 2022-10,     YISHUN,  4 ROOM,    9B, YUNG SHENG RD, 49 TO 51,      145,       Type S1,    2019,        97, 19650.0691667807, 547.386819517238, 3301.63731686804, 2867.63031236184, ...